LLaVA系列：Visual Instruction Tuning

project page, GitHub

Model对比分析

model zoo

LLaVA
- 基于LLaMA-2-7B-Chat, lora 1 epoch
- LLaMA-2-13B-Chat，full_ft 1 epoch
- 基于Vicuna-13B-v1.3，projection 1 epoch, lora 1 epoch，最强
LLaVA-1.5，基于Vicuna-7B/13B-v1.3，LoRA/full-ft LLM
- LoRA版本能跟full-ft打平手
LLaVA-NeXT / LLaVA 1.6，
- 全部full-ft LLM
- 基于Vicuna-7B/13B-1.5, Minstral-7B, Hermes-Yi-34B
- 34B模型版本特别强！

LLaVA的核心

data-efficient, 558K for pretrain, 665K for Instruction Tuning

Pretraining(Alignment)没那么重要，可以轻量级：不需要海量对齐数据，可以冻住LLM只训Connector
Visual Instruction Tuning很重要
可能是因为CLIP Encoder已经在海量数据上对齐了？

简单架构，MLP Connector

切分子图，实现高分辨率：显著降低幻觉！

LLaVA-1.5

LLaVA 1.5 Paper：Improved Baselines with Visual Instruction Tuning

data-efficient, 558K for pretrain, 665K for Instruction Tuning

Pretraining(Alignment)似乎没那么重要，不需要海量对齐数据，而Visual Instruction Tuning很重要

Abalation

Academic Benchmarks需要short-form answer，而LLaVA在这些benchmark上表现不好。做了如下改进：

Response Formatting prompts+ finetune LLM

LLaVA在real-world data下效果好，academic benchmark(short answers)上效果不好
InstructBLIP在academic benchmark上效果好，real-world下效果不好，因为 tends to answer yes for yes/no questions due to the lack of such data in the training distribution

对2023-06-15_InstructBLIP - Towards General-purpose Vision-Language Models with Instruction Tuning的分析：
First, ambiguous prompts on the response format. For example, Q: {Question} A: {Answer}. Such prompts do not clearly indicate the desirable output format.

Second, not finetuning the LLM. The first issue is worsened by InstructBLIP only finetuning the Qformer for instruction-tuning. It requires the Qformer’s visual output tokens to control the length of the LLM’s output to be either long-form or short-form, as in prefix tuning [25], but Qformer may lack the capability of properly doing so.

changing Linear Projection to MLP

在LLaVA-Instruct的基础上，加入academic task oriented data

Scaling：

把CLIP-ViT-L-224换成CLIP-ViT-L-336

LLaVA

LLaVA paper： Visual Instruction Tuning

核心在于利用Caption&BBox，喂给GPT-4，构造了Visual Instruction-Tuning Data

Instruction Tuning Data

从COCO( images + captions + bounding boxes)构建Instruction-following data

use GPT-4 or ChatGPT，输入 Captions/Bounding Boxes，输出
For each type, we first manually design a few examples. They are the only human annotations we have during data collection, and are used as seed examples in in-context-learning to query GPT-4
得到LLaVA-Instruct-158K, including 58K in multi-turn conversations, 23K in detailed description, and 77k in complex reasoning

Model

CLIP ViT-L/24, Linear Layer for projection

Training

Alignment on filtered 505K image-text pairs from CC3M， only tune projection layer

1 epoch

SFT: projection&LLM with LLaVA-Instruct-158K

3 epochs
注意：第一轮prompt中，图&文的顺序是随机的！

Eval

LLaVA-Bench(COCO), 90 questions和LLaVA-Bench(In-the-Wild), 24 images with 60 questions

GPT-4, given textual description(caption) of the image, rates the answer from 1 to 10

Abalations

We tried using the last layer feature from CLIP vision encoder, which yields 89.96% and is 0.96% lower than the feature before the last layer. We hypothesize that this is because CLIP’s last layer features may focus more on global and abstract image properties compared to the layer before it

LLaVA-NeXT / LLaVA-1.6

LLaVA-NeXT(LLaVA 1.6)，improved reasoning, OCR, and world knowledge
论文：improved_llava.pdf，3.4节开始是关于LLaVA-1.6的内容

Compared with LLaVA-1.5, LLaVA-NeXT has several improvements:

Increasing the input image resolution to 4x more pixels. This allows it to grasp more visual details. It supports three aspect ratios, up to 672x672, 336x1344, 1344x336 resolution.
Better visual reasoning and OCR capability with an improved visual instruction tuning data mixture.
Better visual conversation for more scenarios, covering different applications. Better world knowledge and logical reasoning.

Model

It re-uses the pretrained connector of LLaVA-1.5, and still uses less than 1M visual instruction tuning samples. 这一点论文里没说，然而从model zoo看，每个size的model的connector不一样。我理解为由于pretraining阶段的数据集一样，因此第一阶段可以复用同大小model的pretraining。

High-Resolution

切分子图，支持任意分辨率：336 x [(2,2), (1,2), (2,1), (1,3), (3,1), (1,4), (4,1)]，论文附录中还支持了1x5,1x6,2x3

Motivation
When provided with high-resolution images and representations that preserve these details, the model’s capacity to perceive intricate details in an image is significantly improved. It reduces the model hallucination that conjectures the imagined visual content when confronted with low-resolution images.

Abalation证明，高分辨率显著降低了Hallucination！
Pasted image 20240421155254.png

dividing the image into smaller image patches of the resolution that the vision encoder is originally trained for, and encode them independently.
After obtaining the feature maps of individual patches, we then combine them into a single large feature map of the target resolution, and feed that into the LLM
同时，pad and resize the image to a single image of 224x224 , and concatenate it with the high resolution features to provide a global context，Ablation证明有用

实现细节：

predefine a set of resolutions to support up to six grids
Padding & Padding Removal，避免非方形的输入(需要插值)！
换行符，和[Intern-LM-XCompser2-4KHD]一样

Training

第一阶段Alignment Train Connector, 558K
第二阶段 Instruction Tuning Full model, 760K

Ablation

Vicuna-1.5是最强的Base Model，比LLaMA2-Chat强

Data Efficiency：进行data sub-sample之后，仍能有很好的效果，50%的数据也可以有不错的效果

LIMA: less-is-more alignment

通常认为Hallucination来自训练集error，然而，高分辨率显著降低了Hallucination！因此，除了训练集error之外，训练集标注的内容在对应resolution下，无法被模型捕捉到，也可能产生幻觉(相当于模型看不到标注的东西，相当于error)！